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Abstract 

Phylogeographic methods aim to infer migration trends and the history of sampled 
lineages from genetic data. Applications of phylogeography are broad, and in the 
context of pathogens include the reconstruction of transmission histories and the origin 
and emergence of outbreaks. Phylogeographic inference based on bottom-up population 
genetics models is computationally expensive, and as a result faster alternatives based 
on the evolution of discrete traits have become popular. In this paper, we show that 
inference of migration rates and root locations based on discrete trait models is 
extremely unreliable and sensitive to biased sampling. To address this problem, we 
introduce BASTA (BAyesian STructured coalescent Approximation), a new approach 
implemented in BEAST2 that combines the accuracy of methods based on the 
structured coalescent with the computational efficiency required to handle more than 
just few populations. We illustrate the potentially severe implications of poor model 
choice for phylogeographic analyses by investigating the zoonotic transmission of Ebola 
virus. Whereas the structured coalescent analysis correctly infers that successive human 
Ebola outbreaks have been seeded by a large unsampled non-human reservoir 
population, the discrete trait analysis implausibly concludes that undetected 
human-to-human transmission has allowed the virus to persist over the past four 
decades. As genomics takes on an increasingly prominent role informing the control and 
prevention of infectious diseases, it will be vital that phylogeographic inference provides 
robust insights into transmission history. 


Author Summary 

When studying infectious diseases it is often important to understand how germs spread 
from location-to-location, person-to-person, or even one part of the body to another. 
Using phylogeographic methods, it is possible to recover the history of spread of 
pathogens (or other organisms) by studying their genetic material. Here we compare 
different phylogeographic methods based on principled population models and fast 
alternatives. We found that different approaches can give diametrically opposed results, 
and we offer a concrete example in the context of the ongoing Ebola outbreak in West 
Africa. We found that the most popular phylogeographic method often produces 
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completely inaccurate conclusions. One of the reasons of this popularity has been its 
computational speed, which has allowed users to analyse large genetic datasets with 
complex models. More accurate approaches have until now been considerably slower, 
and therefore we propose a new method called BASTA that achieves good accuracy in a 
reasonable time. We are relying more and more on genetic sequencing to learn about 
the origin and spread of infections, and as this role continues to grow, it will be essential 
to use accurate phylogeographic methods when designing policies to prevent or kerb the 
spread of disease. 


Introduction 


Phylogeographic methods aim to infer many aspects of population evolution from 
genetic data. The phylogeography term often encompasses methods to infer changes in 
population sizes (phylodynamics) and population divergence events (see [I]). In this 
work, we focus on the inference of migration between distinct subpopulations (such as in 
the structured coalescent, see D- For many years, nested clade phylogeographic 
analysis (NCPA, see e.g. [3j[4]) was the leading method to test isolation and migration 
(reviewed in 01 ). More recently, model-based inference for phylogeography has 
flourished and has replaced NCPA as the new standard approach (reviewed in © 0 )- 
Phylogeographic model-based approaches for migration have predominantly been 
used to study the spread of pathogens through geographic locations and to infer their 
original source (8 11 


but is also commonly used to study the migration history of 


animals 12-14 , plants 15 16 , and even languages 17 . Phylogeographic methods can 


also be useful to address a wider range of questions in epidemiology, for example when 
studying transmission of pathogens between body compartments in the same host 
between individual hosts 


21 


19 , between host social groups 20 , or between host 


18 


species 

The first class of model-based approaches that we consider are likelihood-based 
methods implementing the structured coalescent 22-26 . These approaches use the 


structured coalescent without approximation to infer migration rates and effective 
population sizes. These methods are not practical in scenarios with large numbers of 
populations and migration events due to their computational demand. In fact, they not 
only explore the space of parameters of primary interest (such as migration rates, 
population sizes, and phylogeny) but also the space of all possible migration histories, 
vastly increasing the computational complexity. 

Recently, an alternative phylogeographic approach has been proposed, which models 
the evolution of locations as a discrete trait, in the same way as evolution of genetic loci 
is usually modeled [8[[9} |14| . Since migration is modelled similarly to genetic mutations, 
this model is referred to as “Mugration” by [27,[28], or more commonly as “discrete trait 
analysis” (DTA in the following). This approximation has become very popular (see 
e.g. [l]) thanks also to its computational efficiency and user-friendly software. Yet, this 
model is based on a set of approximations that profoundly diverge from classical models 


of migration in population genetics (see e.g. 29 31]). While methods based on the 
structured coalescent, which accounts for the effects of migration on the genealogy, are 
preferable over DTA, the latter is often chosen due to the computational demands of 
current implementations of the structured coalescent. DTA is also commonly used to 
describe the evolution of discrete phenotypes. In such cases, DTA may be 

However, this requires some assumptions that are usually not met 
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appropriate 

when the studied trait is a geographic location, including that trait frequencies in the 
global population can drift and reach loss or fixation and that trait sampling is random, 
i.e., that the number of samples carrying each trait is not pre-determined, but is 
obtained by sampling randomly from a single population containing all traits. 
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There is a scarcity of studies in the scientific literature assessing the accuracy of 
DTA and comparing different phylogeographic approaches, but concerns have been 
raised because, among other other issues, DTA is thought to be sensitive to the 
sampling locations chosen 11 34 . Also, because DTA conceptually separates the 


coalescent and the migration process, it might lead to non-optimal use of information. 
Here we demonstrate that these concerns are well founded, in that DTA suffers from 
various biases and statistical inefficiency despite its efficient approximation. 

To address this problem we introduce a new model-based approach that uses a close 
approximation to the structured coalescent (similar in spirit to [35 36]). The idea 
behind this approximation is to efficiently integrate over the parameter space of all 
possible migration histories, therefore reducing the computational effort needed to 
explore the space of parameters of interest. We implement this approach in the 
Bayesian phylogenetic package BEAST2 28 , and call it BASTA (BAyesian STructured 


coalescent Approximation). We compare its performance to DTA and MultiTypeTree 
(MTT, a recent Bayesian structured coalescent software, see [26]) using simulations 
based on the structured coalescent. 

To demonstrate the importance of model choice, we analysed genomic data from 
previous and ongoing Ebola outbreaks 37 using different phylogeographic approaches, 


and studied the contribution of zoonotic events to Ebola contagion. We show that, 
based both on simulations and real data analyses, DTA and structured coalescent 
methods can lead to different conclusions, and in particular DTA is often inaccurate. 


Materials and Methods 


The Structured Coalescent 


Using the notation of 26 , we introduce the structured coalescent model in a Bayesian 


inference setting. Data from a given set of samples / consist of three elements: 

S = {si|* £ 1} the aligned genetic sequences, ti = {ti\i £ 1} the sampling dates, and 
L = {li\i £ /} the sampling locations. We are interested in estimating parameters 
describing the migration and coalescent processes: m the matrix of migration rates 
between demes and 6 the vector of effective population sizes for the considered demes. 
Lastly, the model has additional “nuisance” parameters: T the coalescent tree, /x the 
substitution rate matrix, and M the migration history of lineages in the tree. We want 
to estimate the posterior distribution of the parameters of interest using Bayes’ theorem: 


P(T, M, fi, m, 0\S, tj, L) oc P(S\T, U, fx)P(T, M\ti, L, m, 8)P(fj,, m, 6). (1) 

The first term P(S\T, ti, fx) can be computed using Felsenstein’s pruning 
algorithm [38]. The second term represents the probability of the migration and 
coalescent history under the structured coalescent model given the population genetic 
parameters (the migration rates m and population sizes 6). The last term, P(/r.,m, 0) 
represents the prior distribution of the parameters, and may be factored to 
P{n)P{rn)P{Q) assuming independence of /x, m, and 0. 

We now briefly describe how P(T, M|tj, L, m, 6) is calculated within the structured 
coalescent. We assume that we have a set of unique demes D , where each deme d £ D 
has an effective population size 9d- We further assume that rridd 1 is the 
backward-in-time instantaneous migration rate from population d to population d'. We 
consider the sequence of time intervals 7), * £ 1... B of length i y between subsequent 
events (coalescent, sampling, or migration), starting from the most recent sample and 
going back to the root of the phylogeny. The contribution to the likelihood of interval i 
is, assuming haploidy: 
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Li = exp 


( 2 ) 
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where ki^ d is the number of lineages in deme d in interval i, and Ei is the contribution 
of the event that ends interval i: 


1 

m dd > 

1 

0~d 

The full likelihood is obtained by taking the product of Eq. © over all intervals. 

In the structured coalescent migration events affecting lineages in the tree are 
explicitly parameterized and estimated (Fig. [lj. The next two methods presented, DTA 
and BASTA, avoid this, therefore reducing statistical complexity and computational 
demand. 

Figure 1. Graphical representation of phylogeographic models. In this study 
we consider three phylogeographic methods: the structured coalescent, DTA, and 
BASTA. This figure shows some of the differences in these models, in particular in the 
modelled events and time intervals. Coloured dots show different populations (one red 
population and one green) for both sampled and internal phylogenetic nodes, a) In the 
structured coalescent eight events are considered, delimiting seven time intervals of 
lengths 7~i... 7y. Three of these events are sampling events (denoted by the grey 
horizontal lines), one is a migration event (represented by an arrow between two 
coloured dots), and four are coalescence events, b) In DTA, migration events are not 
explicitly parameterised, so we have a total of seven sampling or coalescence events, 
delimiting six time intervals of lengths r-| ... r@. While in the figure locations for 
internal nodes are depicted, the method effectively integrates over all possible ancestral 
locations at each MCMC step, c) As in DTA, BASTA does not consider migration 
events, and therefore has seven events and six time intervals. Yet, each of these intervals 
is split exactly in half (blue horizontal dotted lines), and the two halves are considered 
separately. Again, as in DTA, at each MCMC step BASTA integrates over all possible 
internal nodes locations. 



if it is a sampling event 
if it is a migration event from d to d! 

if it is a coalescence event in deme d. 


( 3 ) 


Discrete Trait Analysis 

Here we give a short introduction to the Bayesian implementation of the discrete trait 
analysis |8,14|, (DTA). Similarly to the structured coalescent, the focus is on the 
inference of migration parameters, but there are some noticeable differences: all demes 
have the same effective population size; the effects of migration on the coalescent 
process are ignored, and a standard coalescent prior is used for the coalescent tree. As 
mentioned above, migration is modelled as if it were genetic mutation. In summary, 
instead of Eq. <[T[) , the DTA model adopts the following approximation: 


P(T, /x, m, 0\S, ij, L) m P{S\T, t I} n)P{L\T , h, m)P(T|t 7 , 0)P(/x, m, G) (4) 

where again P(/x, m 1 9) may be factored to P(/x)P(m)P(0), both likelihoods 
P(S|T, f/,/x) and P(L\T,ti,m) are calculated with the pruning algorithm, and 
P(T\t d ,G) is simply a neutral coalescent prior for an unstructured population. The 
consequences of this approximation have not been thoroughly explored in the literature, 
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despite the popularity of DTA. One concern is that the effects of migration rates on the 
shape of the phylogeny are ignored. For example, when migration rates are very low, 
long branches close to the root are expected. This type of information is ignored by 
DTA, and this could lead to reduced accuracy. 

Furthermore, sampling locations L are treated as data (with likelihood 
P(L\T,ti,m)), similarly to genetic alignment, while in the structured coalescent they 
are conditioned upon as auxiliary variables in the tree prior P(T, M\ti, L , m, 0), in a 
similar manner to sample size. Sampling locations are usually arbitrarily chosen by the 
investigator, and are not a consequence of the migration process, as assumed in DTA. 
This could upwardly bias estimated migration rates towards over-sampled populations, 
and we explore this eventuality using simulations. An effect of sampling intensity over 
migration rates and ancestral locations estimated using DTA has already been shortly 
reported elsewhere 


11 34 , but has never been explored in detail. 


DTA is also commonly used to describe the evolution of discrete phenotypes (in this 
sense, it is more appropriately referred to as “discrete trait analysis”). This model has 
not to be considered an approximation in such cases, and we do not refer to those cases 
here, but we only consider when DTA are used to model migration. 


BASTA 

We pursue an approximation to the Bayesian structured coalescent that is both accurate 
and computationally efficient. Similarly to Eq[T] we define: 

P(T, fi, m, 0\S , tj, L) oc P(S\T, f/, (j,)P(T\tj, L, m, 0)P(fi , m, 6). (5) 

where P(S|T, fj, fi) and P(^t,m, 6) are calculated as in the structured coalescent. 
However, we marginalise over M, and instead of calculating the joint likelihood of the 
coalescent and migration history P(T, M\tj, L , m, 0), we approximate the likelihood of 
the coalescent history alone P(T\tj, L,m,0) by integrating over all possible migration 
histories M. This reduces the parameter space at the cost of introducing an 
approximation similar to 35.3(i). 

As in the structured coalescent, we consider a sequence of events backward in time, 
but now they can only be sampling or coalescence events (Fig. [l]). These events define a 
sequence of time intervals 4*6 l... B of lengths n between successive events.We 
describe here the contribution to the likelihood of any single interval /,; = [aj_i,aj] 
(with \di-\ — oii\ = Tj, and at being the older event time of I z and a^-i the more recent 
one). The total likelihood is then obtained by multiplying the contribution from each 
interval. The likelihood of the interval p, that we want to approximate, is: 


Li = exp 


EE E 

deD leA I'eAj'jti 


P(Z ed,l'e d\t)—dt 


Ei 


( 6 ) 


where A is the set of all extant lineages during p, and P(l £ d,l' £ d\t) is the 
probability that lineages l and l' are in these same deme d at time t (we are implicitly 
conditioning on the more recent coalescent history). Ei is the contribution of the 
coalescent or sampling event (defined later). Eq. © is different from Eq. © in that we 
are considering intervals delimited only by sampling or coalescence events, and not by 
migration events. In fact, Eq. ([©) integrates over all possible migration histories. 

The first approximation we do is to replace P(l £ d,l' £ d\t) by P(l £ d\t)P(l' £ d\t), 
that is, modeling the migration of lineages as independent from one another. We will 
call Pi t the vector whose element d is Pi,t,d := P(l £ d\t). Our further approximation is 
to split /,; into two sub-intervals of equal length In = [a,_i, (a* + ai- 1 )/ 2 ] and 
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Ii 2 = \{oii + aj_i)/2,aj], such that Pi d is approximated as Pi, ai _ 1 for all t in In and 
Pi,ai for all t in p 2 . Then the approximated likelihood contribution of In becomes: 


L a = exp 


n 

2 


y ] y ] y ] Pl,ou-i,dPl' 

d£D le A 


1 

Od 


(7) 


and similarly is defined La. 

Pi t is a row vector of dimension D , and if l is sampled at time t , Pi t is an indicator 
vector representing the deme of sampling. Given Pi, a ,_, we can calculate P; jQi as: 


Pi, ai = Pi, ai -! exp (t % ■ m) 


( 8 ) 


where time is scaled in N e = generations, the exponential is a matrix 

exponential, and m is the matrix of instantaneous backward migration rates with 
diagonal elements set such that rows sum up to 0. If lineages p and p coalesce to an 
ancestral lineage l at time i, then 


Pi,t = 


Ph,t,iPl 2 ,t,i 

di 


Ph,t,\D\Pl 2 ,t,\D\ 

0\D\ 


|D| Mi ,t,d-^l2,t,d 


(9) 


2 ^ d =1 


0d 


which is the normalised entrywise product (element by element product) of the 
distributions of the coalescing lineages, where Pi } t,d is entry d of vector Pi tt - 
Lastly, the contribution to the likelihood of the event at time at is: 


Ei = 


1 

Thd£D Pl,a it dPl',Oii,d~7r 


if it is a sampling event, 
if it is a coalescence event of l and V. 


( 10 ) 


The contribution to the likelihood of sampling events might in principle be modified to 


account for an informative sampling process 39 . Details of how we efficiently calculate 
these quantities, in particular Eq. ([ 7 ]), are given in Supplementary Text SI. 


Simulations 


We performed simulations under the structured coalescent 29 40 , simulating a neutral 


coalescent of samples from different locations (also called demes or populations). 
Lineages can only coalesce while in the same location, and migration events happen 
according to a pre-specified backward-in-time instantaneous migration rate matrix. So 
to fit the assumptions of DTA, all populations have the same effective population size 
(N d = 1), and effective population size parameters are not estimated. 

In the first simulation setting (called “continents”) we have two locations, with 
lineages in the same location coalescing freely. The migration rate matrix was different 
for each replicate, each time being sampled from the prior used in DTA: the total 
forward migration rate was sampled from an exponential distribution with mean 
m £ {0.5, 2.0, 5.0}; The two forward migration rates (that from location 1 to 2, 77112 , 
and that from two to 1, 77121 ) were sampled independently from r(1.0,1.0) distributions. 
The forward migration rates then were scaled such that the total migration rate (as 

defined in DTA: - ——was m. Lastly, the instantaneous backward migration 

mi 2 + m 2 i 

rates used to simulate the coalescent process were defined as TO 12 = 77121 and 

77i2i = 777 . 12 - This favours DTA by matching its prior on forward migration rates, but 
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not the one of MTT or BASTA, where backward migration rates priors are log-normal 
distributions with /i = ln(m) and a = 4. 

In every simulation the structured coalescent process was simulated starting from 
the leaves (all samples are contemporaneous) and iteratively sampling waiting times for 
events, each of which is either a coalescence or a migration, according to the 


distribution exp(^) 


d£D 


) where rid is the number of extant 


( 2 ) + nd J2d'^d m ' dd ' 

lineages in population d. This process was repeated until the root is reached. 

In a second simulation setting (called “archipelagos”) we have eight locations 
subdivided in two groups, each containing four locations. Each location represents an 
island, while each of the two groups is an archipelago, so that migration rate within an 
archipelago is high, while it is low between archipelagos. All migration rates between 
locations in the same group are identical (mi) as are all migration rates between 
locations in different groups (m 2 ). We fixed mi/rri 2 = 10, and sampled m 2 from an 
exponential distribution with mean 0.5. 

From all simulations, migration rates and root location were then estimated using 
DTA 8 14 , MTT and BASTA, all as implemented in BEAST2. Genetic/phylogenetic 


information provided in input to the three methods was generated from the simulated 
coalescent histories in 3 different ways: in the first setting (“fixed tree”), we assume 
abundant genomic information, such that the coalescent tree is perfectly known and 
only the total tree height scale in coalescent units is unknown, which was achieved by 
fixing the phylogeny in BEAST2 to the simulated one, and only estimating the total 
tree height; in a second strategy (“no data”), we assume that genetic data is extremely 
scarce, and provided no genetic/phylogenetic information at all (this is not realistic, but 
is useful to learn about prior biases of different models); lastly, in the third strategy 
(“variable tree”) we provided an alignment of 2000 bp generated from the simulated 


coalescent tree using SeqGen 41 with a transition/transversion ratio of k = 3 and 


mutation rate of 0.01 in units of N e generations, while we estimated the phylogenetic 
tree in BEAST2 along with the phylogeographic parameters. 


Ebola Transmission Study 

To study changes of host type in Ebola we used whole genome Ebola sequences from 78 


patients recently obtained and aligned with sequences from previous outbreaks 37 


The authors of this study investigated the phylogenetic relationship of samples within or 
between Ebola outbreaks. Here, we apply the three phylogeographic methods presented 
above to infer the contribution of zoonotic events to Ebola spread. We used the same 
alignment provided in 37 for the BEAST analysis, including sampling dates, but we 


also added information regarding host type. All samples were specified to be from 
human hosts, but despite this, we allowed lineages to switch to an animal reservoir 
during their history. Migration was only allowed from animal to human host, and not 
vice-versa. So in our phylogeographic model we had two locations (respectively human 
and animal reservoir) but migration was only assumed to occur in one direction. This 
results in a structured coalescent model with three parameters for MTT and BASTA 
(one migration rate and two effective population sizes), but only two parameters for 
DTA, as only a single general effective population size can be defined in that model. A 
peculiarity of this analysis is that only genetic samples from one of the two considered 
populations were available. While this might seem an impassable limitation, previous 
studies have shown that the structured coalescent provides meaningful estimates even in 
the absence of samples from one populations (i.e. “ghost derne”, see 42 ), suggesting 


that it is possible to perform statistical inference of zoonosis rates from this dataset. 
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Results 


DTA is Inherently Biased by the Sampling Process 


The introduction of approximation can lead to biases in parameter estimation. To test 
for the presence of biases associated with sampling strategy in DTA, MTT and BASTA, 
we performed simulations paired with estimation in the scenario where the genetic data 
was completely uninformative. In the absence of informative genetic data, such as in our 
“no data” scenario, the posterior distribution of a well calibrated Bayesian approach 
should coincide with the prior distribution, and we want to test if this is the case here. 
One of the differences between DTA and the structured coalescent is that the sampling 
locations in the structured coalescent are auxiliary variables which are always 
conditioned over, while in DTA they are treated as part of the data. Usually, sampling 
locations should not be treated in the same way as genetic data because they may be 
quite arbitrary. 

Simulations show that the posterior inference of DTA depends heavily on the 
distribution of sampling locations, unlike the two structured coalescent methods. 
Particularly with high migration rates (prior mean 5.0) DTA posteriors show large 
biases (Fig. UK), meaning that the sampling strategy chosen significantly influences the 
inference. The posterior distributions of the other two methods are noticeably less 
smooth (Fig. [2|3 and C), reflecting the larger computational demand of MTT and 
BASTA. When migration rates are low (prior mean 0.1) DTA over-estimates them 
(Figure SI 4). This is because sampling from two locations implies the presence of at 


least one migration event. However, under the DTA model there is a high prior 
probability of no migration when migration rates are sufficiently low. In contrast, the 
structured coalescent accounts for the fact that there must be at least D — 1 migration 
events when D locations are sampled. 


Figure 2. DTA is inherently biased by the sampling process. To show this, we 
considered the case of a dataset with totally uninformative sequences (this mimics 
extremely short or uniform sequences). DTA treats the sampling process as informative 
about migration parameters, unlike the structured coalescent. This can be seen by the 
perturbation of the posterior relative to the prior distribution, as inferred by (a) DTA, 
(b) MTT, (c) BASTA. We used high migration rates, with prior mean 5.0, and two 
sampling strategies: homogeneous (100 samples per locations, azure) and 
inhomogeneous (respectively 10 and 190 samples per location, green). In pink is the 
prior distribution. On X axis is the ratio of the two migration rates, on Y axis the 
density of the corresponding value in the distribution. Each plot is obtained from 10 
merged posteriors of independent MCMC runs each of 5 x 10 6 iterations. 


DTA Under-represents Uncertainty 

Next we addressed the accuracy of the methods when dealing with highly informative 
sequences. At the opposite extreme to before, methods are expected to perform best 
when genetic information is so informative that the phylogenetic tree can be estimated 
with great detail. When genetic data are extremely informative, for example in some 
cases when whole genomes are available, it may be convenient to assume that the 
phylogeny is known. Here we consider this scenario by providing the true tree topology 
and relative branch lengths as input, and estimating only the tree height together with 
the migration rate parameters. 

The absolute migration rates have different scales under DTA and the structured 
coalescent, so we focus on estimation of the relative migration rate, the ratio of the two 
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considered migration rates. DTA exhibits generally poor performance (Fig. [3] and 
Figure S2j). The 95% confidence intervals are not well calibrated (they include the truth 


56%-81% of the time, compared to 80%-96% for MTT and 84%-97% for BASTA; 
Table [l]), and the correlation between simulated and estimated values is lower 
(0.33-0.64) than competing methods (0.51-0.85 for BASTA and 0.42-0.77 for MTT; 
Table [T]) . Estimation of root location shows similar trends (Fig. [4] and Figure S3). 
Earlier structured coalescent methods had increased computational demands with 
elevated migration rates due to the increased parameter space needed to represent 
migration histories 24 . Here we found that MTT performed similarly well with 


different total migration rates, supporting the view that its new proposal functions 


represent a very considerable improvement over previous efforts 26 


We also compared the performance of the different approaches at intermediate levels 
of information, when there is both phylogenetic signal and phylogenetic uncertainty (the 
“variable tree” scenario). This scenario is probably the most common in practice, but 
also the most complex as phylogenetic uncertainty makes inference more 
computationally demanding. All three methods account for phylogenetic uncertainty by 
exploring the space of possible trees with MCMC. To investigate scenarios with 
phylogenetic uncertainty, we simulated alignments of 2000 bp with a mutation rate of 
0.01 per N e generation. We simulated 50 replicates and a single scenario (prior total 
migration rate of 2.0, and 50 samples per population). All methods reported greater 
uncertainty in this setting, as expected, with DTA showing weaker correlation between 
point estimates and the truth and severely underestimating posterior uncertainty 
compared to BASTA. While MTT most faithfully captures posterior uncertainty, it 
shows the worst correlation between point estimates and the truth, possibly reflecting 


its greater computational demands in the presence of phylogenetic uncertainty (Figure 
S4 and Table [2|. 


Figure 3. DTA retrieves partial information and is not calibrated. We 

simulated the scenario of complete phylogenetic information, that is, when the 
phylogenetic tree is perfectly known to each of the methods. In this scenario methods 
are expected to give the best accuracy. We plot the estimates of the ratio of the two 
migration rates between the two simulated populations using (a) DTA, (b) MTT, (c) 
BASTA. Furthermore, for each method we show the Pearson correlation between the 
true value and the posterior median (“Correlation”), and the proportion of replicates in 
which the true value lies in the 95% confidence interval (“Calibration”). We simulated 
high migration rates (prior mean 5.0), homogeneous sampling (100 samples per 
locations), and 100 replicates. The simulated (true) ratio of the two migration rates is 
on the X-axis, while the estimated ratio is on the Y-axis. The diagonal dashed line 
represents the hypothetical perfect estimate. Each dot represents a posterior median, 
and intervals show the posterior 95% coverage. Number of MCMC steps for DTA, MTT 
and BASTA are respectively 10 6 ,2x 10 5 and 10 5 so to achieve similar running times 
(respectively approximately 180, 200 and 150 seconds per replicate). 


BASTA is Faster than Structured Coalescent Methods 

So far, we have considered scenarios with only two populations, for which structured 
coalescent models are expected to work well. Yet, with more populations, structured 
coalescent methods can be too computationally demanding for practical inference. To 
test the performance of BASTA in such situations, and compare it to MTT, we 
simulated a scenario with eight populations and fixed the tree to facilitate inference 
under the structured coalescent. We subdivided the populations into two archipelagoes 
(clusters of islands), each of four islands, and with 40 samples from each island. 
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Table 1. Summaries of simulations with two populations in the “fixed tree” 
scenario. 


Sampling 

Rate 

Method 

Calibration 

Correlation 

RMSE 

Homogeneous 

Fast 

DTA 

0.56 

0.58 

1.83 



MTT 

0.87 

0.77 

1.32 



BASTA 

0.95 

0.83 

1.51 

Homogeneous 

Slow 

DTA 

0.81 

0.64 

1.65 



MTT 

0.96 

0.75 

1.52 



BASTA 

0.97 

0.81 

1.30 

Inhomogeneous 

Fast 

DTA 

0.68 

0.33 

1.79 



MTT 

0.80 

0.46 

2.50 



BASTA 

0.84 

0.70 

2.08 

Inhomogeneous 

Slow 

DTA 

0.80 

0.39 

1.73 



MTT 

0.85 

0.42 

2.49 



BASTA 

0.88 

0.51 

2.29 


For each scenario 100 replicates were performed. For each summary the value of interest 
is the logarithm of the ratio of the two migration rates. “Sampling” refers to the 
sampling strategy adopted: “Homogeneous” means that 100 samples per populations 
were used, “Inhomogeneous” means that we used 10 samples for one population and 190 
for the other. “Rate” refers to the total migration rate: “Fast” represents a prior mean 
of 5.0, while “Slow” a prior mean of 0.5. “Calibration” represents the proportion of 
replicates for which the true value falls within the 95% posterior interval. “Correlation” 
represent the Pearson correlation between the simulated value and the posterior median. 
“RMSE” represent the root mean square error of the posterior median. 


Table 2. Summaries of simulations with two populations and variable trees. 


Method 

Calibration 

Correlation 

RMSE 

DTA 

0.70 

0.51 

1.68 

MTT 

0.92 

0.49 

2.56 

BASTA 

0.86 

0.56 

2.61 


The datasets are simulated under moderate migration rates (prior mean 2.0), 
homogenous sampling (50 samples per location), with a total of 50 replicates. For each 
summary the value of interest is the logarithm of the ratio of the two migration rates. 
“Calibration” represents the proportion of replicates for which the true value falls within 
the 95% posterior interval. “Correlation” represent the Pearson correlation between the 
simulated value and the posterior median. “RMSE” represent the root mean square 
error of the posterior median. 


Migration between islands in the same archipelago was fast (mean 5.0) while migration 
between archipelagoes was 10-fold lower. Both methods reported considerable 
uncertainty in their estimates of the migration rates and root location (Figure S6). 
However, BASTA reached convergence and acceptable effective sample sizes in 
reasonable time (2 x 10 6 MCMC steps for a total of approximately 1.3 x 10 4 seconds 
per chain); instead, with similar computational effort, MTT was far from convergence 
(see e.g. Fig. [5] and Figure S6 for some randomly sampled replicates) with most 
parameters having an effective sample size (ESS, measuring how well the posterior has 
been explored, see [43]) below 20, way below the generally accepted limit of 200. These 
results show that not only does BASTA produce a modest but consistent improvement 
in calibration and informativeness over MTT (see also Table [3] but has also broader 
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Figure 4. Structure coalescent improves ancestral location inference. We 

measured the accuracy in inferring the location of phylogeny roots for different methods: 
(a) DTA, (b) MTT, (c) BASTA. Each bar represents the posterior support for the true 
root location for a replicate, so taller bars represent better inference. Simulations were 
performed with two locations, fixed trees, high migration rates (prior mean 5.0), and 
homogeneous sampling (100 samples per locations). For every scenario there are 100 
replicates, but only replicates for which the simulated root is in the first location are 
shown. Bars are in increasing order. Number of MCMC steps for DTA, MTT and 
BASTA are respectively 10 6 , 2 x 10 5 and 10 5 so to achieve similar running times 
(respectively approximately 180, 200 and 150 seconds per replicate). 


applicability to scenarios with more populations. 

Figure 5. BASTA has broader applicability than MTT. When simulating a 
scenario with eight populations, BASTA always seemed to efficiently explore the 
parameter space in acceptable time, while MTT, with comparable computational 
resources, never achieved convergence. With these plots we show the traces of the 
posterior probability (Y axis) over the MCMC steps (X axis) in one random replicate. 
Similar plots for further replicates are found in |Figure S6| For these simulations we used 
fixed trees. 


Table 3. BASTA improves rate estimation. 


Method 

Calibration 

Correlation 

RMSE 

within archipelago 

MTT 

0.815 

0.62 

1.49 

BASTA 

0.95 

0.69 

1.33 

between archipelagos 

MTT 

0.98 

0.61 

1.57 

BASTA 

1.00 

0.67 

1.47 


To compare migration rate estimation between MTT and BASTA in a setting with 
many populations, we simulated a scenario with two groups of populations (two 
archipelagos) each containing four populations (four islands) and with 40 samples per 
population. 50 replicates were simulated. The values of interest considered here are the 
logarithm of the migration rate between islands in the same archipelago (top part of the 
table) and between islands in different archipelagos (bottom part of the table). 
“Calibration” represents the proportion of rates among all replicates for which the true 
value falls within the 95% posterior interval. “Correlation” represent the Pearson 
correlation between the simulated value and the posterior median. “RMSE” represent 
the root mean square error of the posterior median. 


Model Choice Strongly Influences Reconstruction of Ebola 
Transmission Dynamics 


While we write, the most deadly known outbreak of Ebola virus is ongoing in West 
Africa. In recent work, Gire et al. 37 have collected and whole genome sequenced 99 


Ebola virus samples from 78 patients. Using these and previous data, the authors have 
shown that all available sequences within each outbreak since 1976 cluster together 
phylogenetically; furthermore, divergence of lineages leading to different outbreaks 
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usually considerably pre-dates the older outbreak. This fact and the shape of the 
inferred phylogeny suggest that of independent zoonotic transmissions are the source of 
different Ebola outbreaks in humans. Ebola infections in different animals have been 


directly observed more than 50 times, with bats thought to be the main reservoir 44 


We addressed this subject in order to explore the potential impact of modelling 
considerations on epidemiological conclusions based on genetic data. We defined a 
highly simplified phylogeographic model with two subpopulations: the first representing 
human hosts, the second representing an animal reservoir. In this model, coalescence 
events within the human population represent human-to-human transmission; similarly 
coalescence events in the animal reservoir represent transmission between animal hosts. 
Migration from the animal reservoir to the human population corresponds to a zoonotic 
transmission. Migration from human to animal was assumed sufficiently rare to be 
ignored (see [44|). 

Using this phylogeographic model, we investigated the effect of model choice - DTA 
versus structured coalescent - on the epidemiological conclusions concerning the role of 
zoonotic transmission in seeding human outbreaks of Ebola. We found that the two 
models gave diametrically opposed results. 

Consistent with general understanding of the emergence of Ebola outbreaks in 
humans, the structured coalescent models, implemented in MTT, inferred that each 
outbreak was seeded by an independent zoonosis from the Ebola reservoir population 
(Fig. [6^), with 18 animal-human zoonoses inferred in the history of the sample (95% Cl 
[15.0, 22.0]), from 1976 until present. In keeping with this, the effective size of the 
animal reservoir was inferred to be very large (95% Cl [7.3, 22.9]) compared to the 
effective size of the virus population sustained in humans (95% Cl [0.27,0.60] ). The 
most recent common ancestor of all sampled human outbreaks was inferred to have 
originated in the animal reservoir population with 100% posterior probability. These 
results were also supported by BASTA. 

In direct contrast, the DTA painted a very different picture of Ebola outbreak 
emergence that does not accord with scientific understanding. With high confidence, no 
zoonotic transmissions from animals to humans were inferred in the history of the 
sampled outbreaks (100% posterior probability), with the most recent common ancestor 
inferred to have occurred in the human population. Despite the implausibility of 
undetected human outbreaks having sustained Ebola virus in humans over four decades, 
the discrete trait model supported this scenario with high confidence. 

These results illustrate the strong influence of model choice on phylogeographic 
inference. They demonstrate the possibility of obtaining implausible results supported 
by high confidence with DTA. Although in the case of Ebola, the strength of evidence 
concerning the epidemiology of the disease is more than sufficient to disregard the 
discrete trait analysis out of hand, it demonstrates the potential to produce highly 
misleading inference when independent epidemiological understanding is scarce. 


Figure 6. Different histories are inferred by different methods. The trees 
were inferred using (a) DTA and (b) MTT. In red we show regions of the phylogeny 
inferred to be in human host, while in blue are regions inferred to be in animal 
reservoirs. The scale of the axis is in number of years from present. Sampling dates and 
locations are included in sample names. Part (a) shows the maximum clade credibility 
tree with median node height inferred with DTA, while part (b) displays a tree sampled 
from MTT output. 
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Discussion 


Phylogeography has rapidly gained prominence in a wide range of settings where it can 
quantify historical patterns of migration from genetic data alone. In the context of 
infectious disease epidemics, phylogeographic methods can infer transmission rates and 
patterns of spread even in the complete absence of reliable epidemiological information 
(see e.g. [8-10 45]). Yet, these methods have only partially been tested and compared. 
Here, with simulations based on explicit process-driven population genetics-based 
models, we showed that different methods exhibit dramatic differences in their inference 
properties. 

While discrete trait analysis (DTA) is extremely fast and accounts for phylogenetic 
uncertainty, it has problems estimating the correct migration rates even with as few as 
two populations. In particular, DTA is sensitive to the relative sampling intensity of 
populations, such that the sampling strategy adopted can influence the results, 
particularly when migration rates are high and genetic data are sparse. We reiterate 
that here we have assessed the performance of DTA as a model of migration, and not in 
the context of evolution of discrete traits, such as some phenotypes, for which the 
discrete trait analysis is appropriate. MTT, on the other hand, proved better calibrated, 
with less biased estimates, with a stronger correlation between simulated and estimated 
values. 

Together with other methods based on structured coalescent, MTT also has the 
advantage over DTA of being able to estimate and account for differences in population 
sizes between populations. Also, it provides estimates of absolute migration rates that 
are meaningful from a population genetics perspective. We want to acknowledge that 
MTT proved useful even in contexts of elevated migration rates, where previous 
structured coalescent-based methods showed convergence problems. Yet, we also show 
that when several demes are considered (we simulated eight demes, but 26 suggest not 
to exceed four) MTT can have convergence issues. To deal with this problem, we 
propose a new approach, BASTA, based on an approximation of the structured 
coalescent similar to some recently proposed 35 36 . BASTA integrates over all the 


possible migration histories rather than sampling them, therefore considerably reducing 
the parameter space that needs to be explored. Not only did this approach show 
appreciable improvements in accuracy with respect to MTT with just two populations, 
but it was possible to obtain MCMC convergence with BASTA for as many as eight 
populations, which was beyond the reach of MTT in feasible time (3-4 hours). 

Finally, using real data from Ebola outbreaks in humans we showed that the choice 
of model is very important, because different models can lead in practice to completely 
different results. In fact, diametrically opposite phylogeographic patterns were 
estimated using DTA rather than structured coalescent-based methods. MTT and 
BASTA gave instead comparable estimates. We recommend that users exercise caution, 
and we point out that methods based on the structured coalescent are in general more 
reliable, although also more computationally demanding. The fact that all three 
approaches considered here are all implemented in the same phylogenetic package 
(BEAST2) is a considerable advantage, as it is possible to run and compare different 
methods while installing a single piece of software and using similar formats. 

In the future, we will work on extending BASTA by including estimates of ancestral 
locations of internal nodes other than the root, and estimates of expected numbers of 
migrations between populations. 
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Supplementary Text SI: Computational Details of 
BASTA - Eq. 0 

Calculating Eqs. [8] and [9] requires similar steps to Felsenstein’s pruning algorithm, and 
also has similar computational demands. We therefore do not focus on its details here. 
Instead we show how we calculate the coalescent rates (Eq. 0). and in particular, the 
sum 

EE E (u) 

deD leA d 

for a given time t, a given set of extant lineages A, and given the probabilities P d t . For 
brevity, from now on we ignore the time index t. If the expected number of lineages in a 
deme d is represented as E(n<i) := E/p a ^ i - we have: 


EE E p ? p ?l = 

dGD IgA a 


E 

deD 





Ey- nn d )nnd)-Y, p i p i 

d£D Ud L le A 


( 12 ) 


Let us call Sd = EjgA P i P i- Calculating Sd requires 0(|A|) time and is needed for 
each deme and twice for each coalescent event. So, if n denotes the number of samples, 
the total cost of computing Sd for the whole tree is approximately 0(n 2 ■ |D|). 
Updating Sd after a coalescent or sampling event is trivial and negligible in time. 
Calculating E(n d ) is also faster, as we can use the same procedure in Eq. ([8]) which 
avoids the sum over lineages, giving a required time of « 0(n ■ |-D| 2 ). Updating E(rid) 
after coalescence and sampling events is trivial and fast. Calculating the exponential of 
the migration rate matrix used in Eq. ([8]) is required once per event, for a total 
computational cost < 0(n ■ |U| 3 ). Lastly, while Eq. ([9]) requires negligible time, Eq. Q 
has computational cost ~ 0(\D\ 2 ), that repeated over all lineages and over all events, 
brings to a total cost of « 0(n 2 ■ |U| 2 ), which is the computational bottleneck of 
BASTA (generally n » \D\). 

To further reduce the computational time, we adopt a caching technique that 
consists in using the same vectors for lineages that have undergone the same history 
since sampling (including same sampling location). If many leaves are sampled at the 
same time, this leads to important savings, but in the worst scenario the total 
computational demand remains « 0{n 2 ■ |-D| 2 ). 
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Figure 7. Figure 1 
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Figure 9. Figure 3 
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Figure 12. Figure 6 
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Figure 14. Figure S2 
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Figure 16. Figure S4 
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Figure 17. Figure S5 
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Figure 18. Figure S6 


MultiTypeTree 


BaStA 



1 (b) 


Calibration: 95% 
Correlation: 0.69 


0.5 1.0 2.0 5.0 10.0 20.0 

True migration rate 



24/ 24 





















































































